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1. INTRODUCTION 

Accidents and disasters, either natural or man-made, are inevitable occurrences over the world, 
leaving people injured, missing, or dead, as a result. These events can affect and cause great damage to 
people’s lives and property. Social media platforms have gained much popularity in the last few decades as a 
way of connecting people irrespective of geographical locations. They also serve as mediums for 
disseminating information, previously restricted to television (TV) and radio outlets. For this reason, and the 
availability of ubiquitous internet-enabled mobile devices, these platforms can contribute to the process of 
rescuing people. Big data frameworks provide possibilities to analyze and make inferences from different 
structures and large volumes of data when building many applications. 

In the case of natural disasters, major advances in meteorological devices and sensors provide 
opportunities for these occurrences to be anticipated and the effect on lives can be considerably reduced. In 
the short period of time following the impact of disasters, the time is taken by government officials, relevant 
agencies, and institutions to respond and provide relief to the population of the affected citizenry is very vital. 
Between 1/1/2016 to 11/11/2016, there were 1985 (659 of these were floods, which triggered 572 whirlwinds 
and 485 landslides) natural disasters recorded in Indonesia, which was the highest record in the previous 10 
years [1]. These events impacted more than two and a half million people causing more than 300 deaths and 
more than 300 injuries. 

Studies show a large usage of social media, especially during a crisis [2]—[10]. The earliest example 
of using social media in disaster time was on the 5th of July 2005, in the UK particularly in London. That is 
when an organized suicide strike hit London, the citizens in their part has uploaded videos and images at the 
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mentioned above sites to share reality with other people [11]. The number of tweets grows significantly 
during a crisis time. For example, during Sandy Hurricane in 2012, there were 20 million tweets posted on 
Twitter. In great East Japan’s Earthquake 2011, there were 33 million tweets posted. Also, in 2010 at Haiti 
Earthquake, a total of 3.28 million tweets were posted. A final example is in 2013 at Typhoon ‘Haiyan’ a 
5.72 million tweets were posted [12]—[21]. 

Three major utilization for social media in the disaster response area. The first usage is a disaster 
event discovery. This type of utilization essentially uses an online monitoring system that detects certain 
keywords when they grow on the web [4], [22]-[24]. The second method of using is asking for help in the 
time of disaster. In spite of the fact that social media plays a role in modern disasters and it has been used to 
share calls for rescues, there are still essential issues regarding rescuing through-replying on social media. 
Distinguishing the most crucial call among the request is a great challenge [25]. The third utilization is to 
detect risk and damage events. Many examples are registered regarding that issue such as earthquake strength 
level chart. Flood inundation detection [26], [27]. The implementation of such systems relies on locals to 
provide real-time information through social media websites about disasters situation [28]. 

To provide information that can boost safety about these victims, social media platforms such as 
Facebook, Twitter, and Instagram (which are widely used) can be very vital. During disasters and the 
aftermath, there’s usually a large presence of information on such platforms on missing, injured, or deceased 
people. Data analysts and engineers can extract such information using natural language processing (NLP) 
tools. Although such results are not 100% precise (due to unstructured data and human error), it provides a 
treasure trove of information with respect to time. 


2. RESEARCH GOAL 

The purpose of this research mainly to implement a plan to be used in the time of natural disasters. 
We can achieve that by developing an application that provides information assistance (in a timely manner) 
in the event of a disaster. Such information will be used to aid people in need of relief and assistance and to 
also provide information for authorities and concerned organizations to take action. This paper aims to 
achieve the following goals: (i) retrieve and analyze in real-time, datasets from social media platforms and 
(11) provide reports inferred from these datasets that can be used for relief efforts. 


3. METHODOLOGY 
This research explores different study items. The methodologies of the research are listed below: 


3.1. Use-case 

For this particular use-case, we will be utilizing Twitter representational state transfer (REST) 
application programming interface (API) to download tweets with the use of an authentication application, 
also, analyze the tweets, and retrieve information about people who are missing or in danger. This work 
presents the following in an application: an interface, location data from the tweets, and filter to reduce the 
number of unimportant tweets that are captured by the system. Figure | shows an illustration of the use-case 
workflow diagram. 

In the Figure 1 there are three main processes are considered the main result of this work. Data 
filtering, text analysis and result analysis and classification. Data filtering process is used to clean the data 
from un necessary characters such as stop words, and links the text analysis phase is to analyze the structure 
of the sentences. Finally, the result analysis phase will capture the sentences that are considered important to 
see by matching grammatical rules and keywords. 


3.2. Interface 

The development of the interface was done by PyQt library. PyQt is a toolkit for presenting 
graphical content based on python and it's for python programming language. It offers options to create 
custom interfaces also collections to specific applications. 


3.3. Acquiring the data 

Twitter was utilized as a source of data to inspect retrieving information which will be vital in 
providing help for persons in need. We focused on retrieving information on missing people. The first step 
undertaken was to understand the context of tweets. The next step involved was categorizing tweets on 
missing people. Implementing this, we retrieved and categorized real-time tweets that were focused on 
Missing people as a point of concern. The captured tweets were filtered with the essential word, “missing”. 

The API of Twitter provides access to public discourse and communication on any subject or a 
combination of subjects that are described by # (hashtags) or keywords. Open to the public information and 
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conversation includes (but not limited to) user data, user situation updates, and retweets. Such conversations 
and information can be streamed and displayed on a screen or analyzed for further inferences. Although the 
API allows conversations that comprise about 1% of Twitter feeds on various topics, it is still a large source 
of data. Getting access to the tweets, an authenticated connection is made through OAuth (which is an open 
standard for authentication) and datasets can be retrieved in formats such as JavaScript object notation 
(JSON) and extensible markup language (XML). 

Tweepy library in Python is used for making the OAuth connection to Twitter API. Credentials for 
authenticating users, however, have to be gotten by developing an application using an enrolled account on 
the social media platform [29]. Figure 2 shows a part of the data retrieved from the API. Only a portion of the 
tweet block specific to our analysis (which included ‘text’ and ‘location’) was retrieved. This is necessary to 
significantly reduce the overhead data streamed over the network. It also reduces the amount of data that goes 
through the filtering process. 












Client 
Authentication 


/ 
Data 
Streaming 
Data 
Filtering 





Figure 1. Workflow diagram 


[ 

“Location: Weymouth , Tweet: Human remains 
discovered in Barton-On-Sea are of missing 
person Isobel Munro https://t.co/kDLcpq9Aev" 
"Location: None, Tweet: RT @niftyvibe: im 
really missing you" 

"Location: bay area, Tweet: damn well ain't 
tryna catch up on all my missing assignments 
because of this fieldtrip" 

"Location: None, Tweet: RT @MVPKENZ: If you 
gonna| attempt to come for me make sure you 
come correct...throwing shots éamp; 
missing\ud83d\udc4e\ud83c\udffe" 

“Location: Savannah, Georgia, Tweet: 'There 
were no red flags," say police ‘Horrific 
Ending' After Mom, 2 Sons Reported 
Missing\u2026 https: //t.co/dI9mgrBxwl 

https: //t.co/s5bs4FxCm3" 


] 
Figure 2. A part of tweets locations and other tweets 


3.4. Filtering the data 
The tweets were cleaned and filtered after retrieval. The filtering process involved removing 
retweets from the dataset of tweets that were retrieved. This was done because retweets usually comprised of 
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duplicated information about a particular person or incident. Cleaning the dataset involved removing 
hyperlinks, incomprehensible words, and words that might possibly interfere with the context of our 
sentence. The resulting output is stored in a text file, each tweet separated by lines. Figure 3 depicts the final 
stage of cleaning the data and after performing filtering operations. 


[ "Location: Weymouth , Tweet: Human remains 
discovered in Barton-On-Sea are of missing 
person Isobel Munro ’, 

"Location: bay area, Tweet: damn well ain't 
tryna catch up on all my missing assignments 
because of this field trip’, 

"Location: Savannah, Georgia, Tweet: 'There 
were no red flags,' say police ‘Horrific 
Ending' After Mom, 2 Sons Reported mijssing 


"Location: Beaumont, TX, Tweet: Vidor ISD 
police search for missing high school 
student ", 

"Location: kirkpatrick Fleming, Tweet: 
Struggling to avoid all the #missing posts 


] 


Figure 3. Sample of data after filtration 


3.5. Text analysis 

Analyzing every row of tweets required some form of semantic analysis for every tweet. A 
categorizer implemented using the natural language toolkit (NLTK) library [29]-[31] to separate tweets that 
discussed security of people and those which were not important to the conversation on security. The process 
of categorization includes line division, stop-word removal, part of speech (POS) categorization, and POS 
annotating. 

Line division involved dividing the ‘tweet’ section of each tweet block into separate words. 
Stop-word elimination involved removing words that had little or inconsequential meaning to the context of 
each tweet to ensure faster parsing. The list of stop-words encompasses ‘from’, ‘in’, ‘out’, ‘and’, ‘on’, ‘a’, 
‘of’, ‘in’, ‘those’, ‘this’. Other symbols and signs including (‘;’, €.. °..7, 5 67, 7", 675 R, P, P, A, 
‘@’,) were also cleaned. Words that contain special characters and their meaning that may have modified 
have been deleted as well. The words are then annotated to match the parts of speech. A customized classifier 
was implemented by merging part of speech annotations. This is helpful in recognizing the context of tweets. 
The tweets which include desired information are chosen, and names (with additional desired specifics) are 
gathered in chunks. 


4. RESULTS 

The results which are attained after grouping each word according to their POS is used to categorize 
and filter sentences corresponding to the classifiers made by the part of speech annotations. The obtained 
findings can be seen as shown in Figure 4 or written to file. Groups that include relevant information are 
categorized into chunks format. Matplotlib [32], [33] can be utilized to present a visual report of these chunks 
describing the formats of each sentence. Figure 4 presents an example of a chunk representing details of a 
missing human being. 


Chunk found VBD asiN 


missing VBG 12-year-old JJ San NNP Antonio NNP gir NN 


Figure 4. Chunks visual illustration 
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For affirmation, a file of the created chunks is generated. The file was subsequently filtered to 
eliminate part of speech tags to enhance the readability. A sample of the sentence also chunks are represented 
in Figure 5. Sequence tests were conducted to verify the effectiveness of the application. The resulting output 
is Shown in Table 1. 

Table 1 explains the results after running the application on sample dataset and full dataset. The first 
row relates to the sample dataset which is (107) tweets, while the second row represents the results for the 
full dataset (528) tweets. The “total tweets” column describes the total number of tweets used to be analyzed 
from twitter API which is considered the dataset. The “total security relevant tweets” column on the other 
hand, is the tweets which are categorized as security tweets, these tweets are what we aim to find throughout 
this study. After manual investigation the number of these tweets is counted to measure the effectiveness of 
the system. Furthermore, the “Classified security Tweets” column describes the tweets that are found by the 
system and categorized as “important” tweets. Additionally, the column “classified non-security-relevant 
tweets” describes the tweets that are similar to security threats and they match the grammar conditions also 
contains the keywords in the sentence, However, they are not actually important for example a sentence 
considered not important when figure of speech is applied in the sentence. Finally, the “accuracy percentage” 
represent the accuracy of finding relevant tweets by the system. For example, on the sample dataset (first 
row) in the Table 1 the accuracy is 100% because the system has captured all the 3 relevant tweets from 107 
tweets. 


[ 

"( Location: Weymouth Tweet: Human remains 
discovered in Barton-On-Sea are (Chunk 
missing person Isobel Munro) " 

*( Location: Weymouth Tweet: Human remains 
discovered in (Chunk Barton-On-Sea are 
missing) person Isobel Munro)" 

"( Location: Savannah Georgia Tweet: 'There 
were no red flags 'PO say police ‘Horrific 
Ending 'PO After Mom 2 (Chunk Sons Reported 
missing )" 

*( Location: Beaumont TX Tweet: Vidor ISD 
police search for (Chunk missing high school 
student )" 

"( Location: kirkpatrick (Chunk Fleming 
Tweet: Struggling) to avoid all missing 
posts)" 

"( Location: earth Tweet: happy 27th birthday 
(Chunk missing one bestS nights) my$ life 
tylerrjoseph)" 


Figure 5. Sentences of chunks 


Table 1. Functionality experimenting 


Total Total Security- Classified Classified Non-Security- Accuracy 
Tweets Relevant Tweets Security Tweets Relevant Tweets Percentage 
107 3 3 30 100 
527 36 31 127 86.11 


5. DISCUSSION 

To run the verification, the tweets number which was related to security issues were counted before 
running it through the built classifier. The tweets were filtered by the application, separating security tweets 
from non-security tweets. The first iteration with 107 tweets gave a %100 accuracy, while the second 
iteration resulted in an accuracy of 86.11%, still containing some non-security tweets. The first iteration run 
is done on the sample dataset which consist of 107 tweets. The second iteration on the other hand, is the run 
on the full dataset which is 527 tweets as shown in Table 1. The reason for using different datasets for tweets 
is to ensure fast performance while implementing the application also the effectiveness of the application 
results. The results obtained show the fundamental challenge usually experienced when processing natural 
language, more so in written form. For better results, the classifier needs to be improved. It also asserts the 
need for human input for activities such as verification for an application that concerns sensitive issues such 
as security and disasters. 
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6. CONCLUSION 

To conclude, this paper draws a roadmap to implement a disaster management assistance plan based 
on utilizing the technology to save people lives. Such approach can assist to eliminate inefficiency during 
natural disasters and enable teams to focus on important cases. As this paper has tried to establish, there is a 
lot that can be obtained from information circulating on social media platforms in disasters periods. Much 
falsity in the potential of gleaning information from such datasets using tools that process and analyze big 
data and natural language effectively. While there is a steady improvement of these technologies in research 
and development, there is no harm in getting what can be obtained from the tools that are available at the 
present time. 

In this paper, an approach for analyzing tweets from the twitter platform using spark and text 
analysis was presented. Since we are working with unstructured and informal language from the data source 
(twitter), it is still a great challenge to receive relevant data and understand it from an infinite stream. The 
persons’ names that require help are categorized into chunks and stored in text files for further manual 
verification or visual presentation (using plots). This leads to a robust plan to rescu people in danger during 
natural disasters using the power of technology. 
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